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Abstract — A person  is  considered  as  an  influential  individual 
when  his  behaviors  can  trigger  other  people’s  reactions.  Such 
phenomenon  is  called  user  influence  in  social  networks. 
Measuring  user  influence  provides  insights  into  dynamics  of 
social  network  interactions.  This  makes  a fundamental  step  for 
constructing  marketing  strategy,  recommendation  systems  and 
so  on.  There  have  been  various  studies  focusing  on  this 
problem.  However,  it  lacks  of  a satisfactory  method  to  measure 
user  influence  in  a reasonable  way.  It  is  also  worth  studying  on 
user  properties  and  past  activities  that  contribute  to  the 
influence  of  each  user.  In  this  paper,  we  investigate  the 
attributes  of  millions  of  social  network  users  and  the  content  of 
their  messages  in  order  to  better  predict  user  influence.  These 
users  and  messages  are  from  Sina  Weibo,  which  is  one  of  the 
most  popular  social  networks  in  China.  Our  first  contribution 
is  to  quantify  the  influence  of  individuals  within  a period  of 
time  by  using  a new  approach  and  find  the  individual  influence 
of  most  users  changing  over  time,  but  most  changes  are  not 
significant.  Our  second  contribution  is  to  propose  a phrase 
merging  algorithm  for  obtaining  high-quality  phrases,  which 
are  very  helpful  for  extracting  the  topics  that  each  user  is 
interested  in.  Our  third  contribution  is  to  predict  the  influence 
of  each  user  with  a higher  precision. 
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I.  INTRODUCTION 

As  a social  media  platform  [1],  Sina  Weibo  allows 
people  to  post  brief  messages  to  air  their  opinions  and  their 
followers  can  forward  or  comment  on  these  posts.  This 
platform  makes  opinions  spread  faster  than  traditional  media. 
Thus,  it  has  gained  huge  popularity  in  China  [2].  Due  to  the 
data  richness  and  public  availability  of  the  microblogging 
system,  researchers  have  been  interested  in  studying  user 
influence  in  social  networks  recently.  Studying  the  question 
has  a realistic  significance  in  our  real  life.  For  example,  it 
helps  marketing  effort  to  target  on  most  influential 
individuals  for  delivering  ads. 

Some  work  has  been  made  on  the  problem  of  measuring 
user  influence.  In  Cha’s  study  [3],  user  influence  in  social 
media  was  computed  by  using  the  number  of  followers, 
retweets,  and  mentions  gained  on  Twitter.  Another  similar 
metric  proposed  by  Anger  et  al.  [4]  measured  user  influence 
on  account  of  the  ratio  of  the  number  of  followers  and  the 


number  of  friends.  The  underlying  assumption  is  that  if  a 
person  has  more  fans  and  pays  less  attention  to  others,  he  is 
considered  to  be  influential.  Therefore,  the  ratio  is  a little 
better  than  the  former  method,  but  it  is  still  imprecise.  To 
measure  social  networking  potential  of  microblogging  users, 
some  other  researchers  have  quantified  user  influence  from 
the  perspective  of  network  structure.  Among  them.  Brown  et 
al.  [5]  used  a modified  K-shell  algorithm  to  measure  the 
influence  of  each  user  on  Twitter.  A ProfileRank  algorithm 
for  identifying  influential  users  was  proposed  by  Silva  et  al. 
[6].  Considering  both  the  topical  similarity  between  Twitter 
users  and  the  network  structure,  Weng  et  al.  [7]  measured  the 
influence  with  an  extension  of  the  PageRank  algorithm. 
However,  it  is  well-known  that  some  followers  purely 
receive  messages,  but  may  not  read  them,  let  alone  they  can 
be  influenced.  Therefore,  these  measures  do  not  accurately 
capture  the  essence  of  influence  by  using  network  metrics 
without  distinguishing  followers’  properties.  The  various 
approaches  [8,  9,  10]  currently  being  proposed  to  quantify 
user  influence  are  not  the  obvious  best  choice.  The  work  of 
Bakshy  et  al.  [1 1,  12]  has  inspired  our  study. 

Based  on  previous  studies,  we  propose  a new  way  of 
quantifying  user  influence  by  considering  both  the  quantity 
of  messages  posted  and  their  popularity.  The  proposed 
method  is  evaluated  by  using  data  from  Sina  Weibo.  Based 
on  this  definition,  our  study  reveals  that  the  individual 
influence  of  most  users  changes  over  time,  but  most  changes 
are  not  significant.  This  conclusion  is  in  line  with  Akritidis's 
[13].  In  our  opinion,  what  is  more  meaningful  is  predicting 
one’s  influence  in  the  future  instead  of  ex-post  analysis.  We 
believe  that  user  influence  depends  on  user  properties  and 
past  activities.  The  user  properties  and  past  activities  are 
divided  into  two  distinct  types  of  features:  statistical  feature 
and  topic  feature.  All  the  statistical  features  mentioned  in  our 
work  are  easy  to  calculate  and  to  understand.  Here  we  lay 
special  emphasis  on  extracting  topic  feature  by  using  a novel 
approach  of  phrase  merging  algorithm.  The  topic 
information  extracted  is  used  to  improve  the  performance  of 
predictive  models.  Thus,  the  influence  of  each  user  is 
predicted  successfully. 

The  rest  of  the  paper  is  organized  as  follows:  a real  world 
dataset  has  been  prepared  for  our  study  in  Section  2.  In 
Section  3,  we  propose  a novel  approach  of  measuring  user 
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influence  for  everyone  in  Microblogs.  In  Section  4,  we 
detailedly  describe  two  types  of  features:  statistical  feature 
and  topic  feature.  In  Section  5,  we  present  our  experiments 
and  analyze  results.  Conclusions  and  future  work  are  given 
in  Section  6. 

II.  Sina  Weibo  Dataset 

To  prepare  for  the  experiment,  we  collected  large 
amounts  of  data  from  Sina  Weibo  over  the  two-year  period 
of  April  1 2013  - March  31  2015.  As  for  a user,  we  can 
obtain  his/her  screen  name,  favorites,  description,  gender, 
registration  date,  number  of  followers  and  friends  and  so  on. 
As  for  a message,  we  can  obtain  the  author  of  the  message, 
the  content  of  the  message,  post  time,  whether  there  are 
pictures  contained,  the  number  of  reposting  it  and  the 
number  of  replying  to  it. 

Since  one’s  posts  can  represent  his  viewpoint,  we  restrict 
the  study  to  seed  content  for  each  user,  meaning  the 
messages  are  not  reposted  from  others.  There  are  in  total 
1,262,518  users  with  their  1 14,286,565  seed  messages  in  our 
dataset.  The  total  number  of  reposting  and  replying  to  a 
message  is  regarded  as  the  number  of  reacting  to  it,  because 
reposting  behavior  and  replying  behavior  are  both  responses 
to  it.  As  Table  I shows,  the  average  number  of  publishing 
seed  messages  is  92.643  for  each  user  in  this  two  years.  The 
average  number  of  reacting  to  each  message  is  5.537  and  the 
median  is  1,  implying  that  most  messages  do  not  attract  too 
much  attention. 


TABLE  I.  Statistics  of  Users  and  Their  Seed  Messages 


Statistics 

Number  of  posts 
per  user 

Number  of 
reactions  per  post 

Minimum 

3 

0 

Maximum 

2000 

1726182 

Median 

31 

1 

Mean 

92.643 

5.537 

StdDev 

202315 

1203.825 

\ 

Vi 

0 
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Number  of  reactions  per  message 

Figure  1 . Distribution  of  posts  with  different  reactions.  Here  horizontal 
axis  represents  the  logarithm  of  the  number  of  reactions  per  message  and 
vertical  axis  represents  the  logarithm  of  the  number  of  posts. 


The  distribution  of  posts  with  different  reactions  is 
shown  in  Figure  1.  It  follows  an  approximately  power-law 
distribution,  which  is  in  line  with  the  actual  situation.  Figure 
1 show  that  only  a small  portion  of  the  messages  are  reacted 
thousands  of  times. 

III.  Computing  U ser  Influence 
This  section  introduces  a new  way  to  identify  user 
influence.  Based  on  this  approach,  we  get  some  meaningful 
conclusions  by  observing  the  individual  influence  change  of 
each  user  within  a period  of  time. 

A.  A Novel  Approach  for  Measuring  User  Influence 

In  the  paper,  user  influence  corresponds  to  a particular 
definition  that  one's  ability  to  make  his  followers  react  to  his 
messages.  Measuring  user  influence  is  broken  down  into  two 
steps.  First,  we  quantify  the  influence  score  of  a message. 
When  influenced  by  a message,  people  usually  forward  or 
reply  to  it.  The  two  behaviors  are  both  the  marks  of  the 
occurrence  of  influence.  Therefore,  the  total  number  of 
reposting  and  replying  to  a message  is  taken  as  its  influence 
score.  Second,  we  refer  to  the  definition  of/i-index  [14]  to 
compute  the  influence  score  for  each  user.  In  a similar  way, 
we  regard  one's  influence  score  as  h who  has  posted  h. 
messages  each  of  which  influence  score  is  at  least  h in  a 
period  of  time.  The  higher  one’s  score  is,  the  greater  the 
influence  he  owns.  Here  time  is  set  to  one  year,  because  one 
year  is  a complete  cycle  and  time  is  too  short  to  compute  the 
influence  score  of  each  user  effectively  using  the  definition. 

As  the  example  shows  in  Table  II,  / is  the  function  that 
corresponds  to  the  count  of  reactions  for  each  post.  If  a user 
with  five  posts  A,  B,  C,  D,  and  E with  13,  8,  7,  5,  and  1 
reactions,  respectively,  his  influence  score  is  equal  to  4 
because  the  fourth  post  has  5 reactions  and  the  fifth  has  only 
1.  Similarly,  if  the  five  posts  have  12,  9,  3,  1,  and  0,  then  his 
influence  score  is  3 because  the  fourth  post  has  only  1 
reaction. 


TABLE  II.  A Example  of  Computing  User  Influence  Score 
with  Our  Approach 


m 

f(B) 

f(C) 

f(D) 

m 

User  influence  score 

13 

8 

7 

5 

i 

4 

12 

9 

3 

1 

0 

3 

The  definition  is  designed  to  improve  upon  simpler 
measures  such  as  the  number  of  posts  or  reactions.  It  reflects 
both  the  number  of  posts  and  the  number  of  reactions  per 
message.  It  offers  a more  reasonable  quantitative  method  of 
determining  user  influence  and  gives  us  a deeper 
understanding  the  concept  of  user  influence  in  social  media. 

B.  User  Influence  Changes  over  Time 

According  to  the  definition  of  computing  user  influence, 
we  compare  the  change  of  the  first-year  influence  score  and 
the  second-year  influence  score  for  each  user.  Surprisingly, 
we  find  that  89.2%  of  the  users  have  changed  on  individual 
influence.  Furthermore,  we  want  to  explore  how  much  have 
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changed  on  the  whole.  Therefore,  the  cosine  similarity  is 
used  to  assess  the  whole  change.  Its  definition  is  as  follows, 

sim  = O.S  + 0.5  x ■■  &&&!?-  (1) 

JuJU&iPxJxjUCyi)2 

where  xt,  yt  are  the  first-year  influence  score  and  the  second- 
year  influence  score  respectively,  and  the  cosine  similarity 
value  is  normalized  into  the  range  [0,  1]. 

Since  its  value  ( sim  equals  0.8015)  is  close  to  1,  we 
conclude  that  the  individual  influence  of  most  users  changing 
over  time,  but  most  changes  are  not  significant. 

IV.  Features 

As  is  known  to  all,  feature  selection  is  a very  important 
process  that  is  directly  related  to  the  performance  of 
predictive  models.  The  related  features  we  explore  are 
divided  into  two  distinct  categories:  statistical  feature  and 
topic  feature.  While  statistical  feature  refers  to  user  attributes 
that  are  related  to  the  author  and  various  statistics  about  the 
content  of  seed  messages,  topic  feature  is  a probability 
distribution  over  several  topics  that  each  user  is  interested  in. 
The  detailed  components  of  each  category  are  introduced  in 
the  following  parts. 

A.  Statistical  Feature 

The  statistical  features  are  so  easy  to  calculate  and  to 
understand,  which  are  referenced  frequently  in  many 
research  work  on  social  media  [15,  16].  For  a clearer 
description,  1 9 statistical  features  we  mention  can  be  divided 
again  into  two  parts:  13  statistical  features  about  user 
properties  and  6 statistical  features  about  the  content. 

On  the  one  hand,  according  to  related  work  [17,  18,  19], 
we  use  the  following  statistics  as  the  statistical  features  of 
user  properties:  number  of  followers,  number  of  friends, 
number  of  microblogs,  number  of  favorites,  number  of  bi- 
followers, number  of  important  friends,  whether  contains  url, 
verification,  length  of  screen  name,  length  of  description, 
date  of  joining,  gender  and  the  follower  to  friend  ratio. 
Among  these  features,  the  number  of  microblogs  is  the 
number  of  messages  posted,  including  seed  messages  and 
republished  messages.  The  number  of  favorites  is  the  number 
of  other  people's  microblogs  the  user  collected.  Bi-followers 
indicate  that  two  users  follow  each  other  mutually.  Important 
friends  are  the  persons  who  the  user  pays  special  attention  to. 
Url  is  a binary  variable  that  indicates  whether  there  is  a blog 
address  in  the  personal  properties.  Verification  means  that 
the  user  has  been  certificated  by  Sina  Weibo.  The  follower  to 
friend  ratio  represents  a combined  statistical  feature  that  is 
the  ratio  of  the  number  of  followers  and  the  number  of 
friends. 

On  the  other  hand,  it  takes  into  account  the  number  of 
hashtags,  links,  pictures,  mentions  and  words  in  the 
microblog  and  the  length  of  the  microblog  as  the  statistical 
features  about  the  content.  The  number  of  words  and  the 
length  of  the  message  are  two  different  meanings.  One 
indicates  the  number  of  actual  words  and  the  other  is  the 
number  of  characters  contained.  Thus,  these  features 


correspond  to  each  message  not  to  each  user.  As  for  each 
user,  we  regard  the  mean  of  his  entire  messages  over  a period 
of  time  as  his  statistical  features.  Based  on  the  accessibility 
and  effectiveness  of  the  statistical  features,  researchers  prefer 
to  do  some  research  on  the  content  of  microblogs  with  them 
[20,  21,  22].  These  statistical  features  are  also  used  in  our 
follow-up  experiments. 

B.  Topic  Feature 

To  extract  topic  feature,  we  need  to  integrate  the 
messages  from  each  microblogging  user.  Since  we  aim  to 
understand  the  topics  that  each  user  is  concerned  about 
instead  of  the  topic  that  each  single  microblog  is  about,  all 
the  microblogs  of  each  user  were  aggregated  into  a big 
document.  Thus,  each  document  essentially  corresponded  to 
a user. 

The  process  of  segmenting  a document  into  ’bag-of- 
words’  is  broken  down  into  two  steps.  First,  the  microblogs 
in  our  dataset  are  mainly  written  in  Chinese.  Therefore,  we 
remove  non-Chinese  characters  from  the  content,  which  do 
not  help  in  topic  modeling.  Since  Chinese  word 
segmentation  is  different  from  English’s,  we  segment 
Chinese  words  with  the  method  Ansj  [23],  which  is  one  of 
the  most  popular  Chinese  word-cutting  methods.  Thus,  a 
document  is  segmented  into  a collection  of  words.  Second, 
the  frequency  of  each  word  is  counted  after  removing  stop 
words  from  the  collection.  In  general,  we  set  a minimum 
support  to  remove  low-frequency  words.  Thus,  a document 
is  segmented  into  ’bag-of-words’  that  provide  a new 
representation  for  documents. 

1)  Phrase  Merging  Algorithm:  The  main  novelty  in 
extracting  topic  feature  is  the  way  we  transform  the  above 
original  ‘bag-of-words’  to  a high-quality  ‘bag-of-phrases’. 

Based  on  statistical  analysis  about  the  occurrence  of 
words,  we  consider  a null  hypothesis  that  the  documents  are 
generated  from  a series  of  independent  Bernoulli  trials  [24]. 
Under  this  hypothesis,  we  provide  a new  quantitative 
measure  of  which  two  adjacent  words  merge  the  best 
collocation.  Here  all  the  words  that  would  be  merged  must 
be  frequent,  which  is  the  primary  condition.  The  significance 
score  is  measured  by  comparing  the  frequency  of  two 
adjacent  words  with  the  occurrence  count  of  each  word 
independently.  In  the  following  definition, 

Si^oPj)  = min[ll  <2) 

where  g is  the  function  that  corresponds  to  the  actual  number 
of  a word  in  all  the  documents,  p*  and  py  are  two  different 
words,  and  Pi  &py  indicates  the  consecutive  co-occurrence  of 
two  words.  Equation  2 computes  the  significance  value  as  a 
robust  collocation  measure  in  selecting  two  adjacent  words 
for  merging.  A high  score  stands  for  a high-belief  that  two 
different  words  are  highly  associated  and  should  be  merged. 

On  the  basis  of  the  above  quantitative  measure,  the 
phrase  merging  algorithm  we  present  is  shown  in  Figure  2.  It 
is  a bottom-up  iterative  algorithm.  At  each  iteration,  the 
contiguous  pair  with  the  highest  score  (greater  than  or  equal 
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A Sample  of  LDA  Output  Result 


to  the  threshold  value  we  set  in  advance)  will  be  selected  and 
merged.  The  newly  merged  phrase  is  considered  as  a single 
unit  at  the  next  iteration.  The  algorithm  terminates  when  the 
following  merging  with  the  highest  score  does  not  meet  the 
threshold  or  when  all  the  words  have  been  merged  into  a 
single  unit. 


TABLE  III. 


Topicl 

Topic2 

Topic3 

Topic4 

0.0067069 

0.0053117 

0.0055864 

0.3058648 

0.0085345 

0.0104205 

0.0099841 

0.0071137 

Until  recently  virtual  reality  technology  develops  rapidly 


Until  recently  virtual  reality  technology  develops  rapidly 


Figure  2.  Phrase  merging  of  the  ’bag-of-words’  of  a seed  message  that  is 
translated  into  English. 

Since  our  phrase  merging  algorithm  is  purely  data-driven, 
it  has  two  main  advantages.  One  of  the  advantages  is  that  it 
can  be  implemented  without  knowing  anything  about  domain 
knowledge  and  complicated  language  rules  in  advance.  The 
other  one  is  that  the  greater  the  amount  of  data  is,  the  better 
the  results  yielded,  which  is  suitable  for  big  data  analysis. 

2)  Topic  Distillation:  The  purpose  of  topic  distillation  is 
to  automatically  identify  the  distribution  of  topics  for  each 
document  i.e.  each  user.  Latent  Dirichlet  Allocation  (LDA) 
model  [25,  26,  27]  is  used  to  accomplish  this  task. 

In  the  paper,  a big  data  analysis  platform  - Spark  is 
applied  to  process  the  large  amount  of  data.  Spark  is  an 
offline  big  data  processing  framework  and  has  attracted  a 
great  deal  of  attention  from  scientific  researchers  recently 
[28,  29].  Here  we  use  Spark  implementation  of  LDA  model 
algorithm.  The  model  LDA  takes  as  input  the  high- 
quality  ’bag-of-phrases’,  and  its  result  is  represented  in  one 
matrix,  a m x n matrix,  where  m is  the  number  of  users  and 
n is  the  number  of  topics.  Each  row  vector  is  represented  as  a 
probability  distribution  over  n topics  that  each  user  is 
interested  in.  As  Table  III  shows,  a sample  of  LDA  output  is 
displayed.  The  first  line  of  data  in  the  table  indicates  that  the 
first  person  is  interested  in  the  fourth  topic.  In  contrast,  the 
second  person  rarely  publishes  the  messages  about  these  four 
topics. 

We  choose  the  high-quality  ’bag-of-phrases’  as  the  input 
of  LDA  over  the  original  ’bag-of-words’  to  ensure  that 
tokens  in  the  same  word  are  assigned  to  the  same  topic.  That 
is  not  only  why  original  words  need  to  be  merged,  but  also 
an  innovation  point  of  our  work.  Just  because  of  this,  the 
topic  feature  can  be  obtained  for  the  following  experiments. 


V.  Experiments 

In  order  to  predict  user  influence,  a set  of  experiments 
were  conducted  on  real  world  datasets  from  Sina  Weibo. 
First,  we  describe  our  experimental  environment  and 
evaluation  metrics  that  are  the  basis  of  experiments.  Second, 
a comparison  experiment  is  presented  to  demonstrate  the 
performance  of  predictive  models,  especially  after  adding 
topic  feature. 

A.  Experimental  Environment  and  Evaluation  Metrics 

In  this  work,  all  experiments  were  carried  out  by  using 
big  data  mining  techniques.  For  this  reason,  Spark  was  built 
with  a total  of  four  high-performance  servers.  The  training 
datasets  and  the  testing  datasets  were  both  put  on  HDFS  [30], 
which  is  beneficial  to  parallel  computing.  HDFS  is  the 
abbreviation  of  Hadoop  Distributed  File  System,  which  is  a 
popular  distributed  data  storage  platform. 

To  assess  prediction  accuracy  of  regression  models,  we 
use  two  different  evaluation  metrics:  RMSE  and  R2 . They 
are  given  by 


RMSE 


ZlLifrrfr)2 


(3) 


d2  _ ZiU(yf-fr)2 
Zf=1(yi-y)2 


(4) 


where  yt , y*  are  the  actual  values  and  predicted  values 
respectively,  and  y is  the  mean  of  the  actual  values. 
Obviously,  the  smaller  RMSE  is,  the  better  the  accuracy  is. 
On  the  contrary,  R2  is  higher,  the  result  will  be  better.  The 
former  value  is  0 and  the  latter  equals  1 under  the  perfect 
condition. 

B.  Predicting  User  Influence  Using  Statistical  Features 

For  the  purpose  of  predicting  individual  influence,  we 
compared  different  performance  of  three  popular  regression 
models:  Decision  Tree,  Random  Forest  and  Gradient 
Boosting.  Both  Random  Forest  and  Gradient  Boosting  are 
algorithms  for  learning  ensembles  of  trees,  but  the  training 
processes  are  different  [31].  Because  of  the  need  of  big  data 
analysis,  we  used  Spark  implementation  of  these  algorithms. 

In  the  experiments,  we  aggregated  all  seed  posts  of  each 
user  and  computed  individual  level  influence  for  each  user. 
One’s  second-year  influence  is  used  as  the  label,  and  his  20 
features  including  13  statistical  features  about  user  attributes, 
6 statistical  features  about  the  content  posted  in  the  first  year 
and  the  first-year  influence  score  were  input  into  our  models. 
While  80%  of  users  were  selected  randomly  as  training 
datasets,  the  other  20%  users  were  regarded  as  testing 
datasets.  For  the  20%  users,  we  compared  predicted 
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influence  scores  with  actual  influence  scores  computed  from 
the  second-year  data. 

By  doing  so,  Table  IV  shows  different  performance  of 
three  models  on  the  task  of  predicting  individual  influence. 
This  observation  shows  that  Gradient  Boosting  performs 
better  than  the  other  two  models. 

T ABLE  IV.  The  Performance  of  the  Three  Models  with  Only 

19  Statistical  Features  and  Past  Influence 


Predictive  models 

RAISE 

R-squared 

Random  Forest 

2.5432 

0.5534 

Decision  Tree 

2.4287 

0.5628 

Gradient  Boosting 

2.3819 

0.5691 

C.  Predicting  User  Influence  adding  Topic  Feature 

Conceivably,  we  could  do  better  at  predicting  individual 
influence  if  knowing  the  semantic  attribute  of  the  content. 
Therefore,  we  repeated  the  analysis  of  these  models  after 
adding  the  topic  feature. 

TABLE  V.  The  Performance  of  the  Three  Models  after 

Adding  Topic  Feature 


Predictive  models 

RAISE 

R-squared 

Random  Forest 

1.9214 

0.6048 

Decision  Tree 

1.7358 

0.6152 

Gradient  Boosting 

1.5962 

0.6236 

It  can  be  observed  that  all  the  metrics  of  these  models  in 
Table  V are  better  than  Table  IV’s.  In  other  word,  the  three 
models  are  improved  significantly  by  the  addition  of  the 
topic  feature.  It  is  worth  noting  that  Gradient  Boosting  is  still 
the  best  of  all  the  algorithms. 

In  summary,  the  topic  feature  is  an  effective  feature  that 
can  extract  some  useful  information  for  better  predicting  user 
influence.  The  individual  influence  of  each  user  can  be 
predicted  with  a satisfactory  accuracy. 

VI.  Conclusions 

Our  work  focuses  on  the  problem  of  predicting  individual 
influence  for  each  user  under  the  environment  of  big  data.  In 
this  paper,  a new  attempt  to  measure  user  influence  is 
proposed.  User  influence  is  defined  as  the  potential  of  one’s 
actions  that  motivates  others  to  republish  or  reply  to  his 
messages.  This  definition  measures  the  influence  taking  both 
the  quantity  of  messages  posted  and  their  popularity  into 
account.  Thus,  we  can  compute  the  influence  score  of  each 
user  with  this  criterion.  At  the  same  time,  we  also  find  that 
most  individual  influences  change  over  time.  As  everyone 
knows,  ex  ante  forecast  is  better  than  ex  post  analysis.  To 
predict  the  influence  of  each  user  in  the  future,  we  analyze  in 
detail  19  statistical  features,  past  influence  and  topic  feature. 
What’s  more,  the  phrase  merging  algorithm  we  propose 
improves  the  output  quality  of  LDA  model,  which 
effectively  extract  the  topic  feature.  Thus,  three  popular 


regression  models  implemented  by  Spark  are  used  to 
successfully  predict  the  influence  score  of  each  user, 
especially  after  adding  topic  information. 
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